ACEEE Int. J. on Information Technology, Vol. 01, No. 03, Dec 2011 

Simple Data Compression by Differential Analysis 
using Bit Reduction and Number System Theory 

Debashish Chakraborty 1 , Sandipan Bera 2 , Anil Kumar Gupta 2 , Soujit Mondal 2 

Department of Computer Science & Engineering 1 and Information Technology 2 

St. Thomas' College of Engineering. & Technology, 

Kolkata-23, West Bengal, India 

sunnydeba@gmail.com,sandipan.bera@gmail.com, anilgupta00749@gmail.com 



Abstract — This is a simple algorithm which is based on number 
theory system and file differential technique. It employs a 
technique which unlike other is independent of repetition 
and frequency of character.In the algorithm original file is 
broken into some small files using differential techniques 
and then every small file is considered as certain n-base 
number system where n is the number of different characters 
in the file. Now the compression is done converting this n- 
base number system to binary number system. The main 
concept behind this algorithm is to save the bits by converting 
the higher number system to lower one. It is a simple 
compression and decompression process which is free from 
time complexity. 
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I. Introduction 



Data compression or source coding is the process of 
encoding information using fewer bits (or other information 
bearing units) than a non encoded representation would use 
through specific use of encoding schemes [1,2,3,5]. It follows 
that the receiver must be aware of the encoding scheme in 
order to decode the data to its original form. The compression 
schemes that are designed are basically trade- offs among 
the degree of data compression, the amount of distortion 
introduced and the resources (software and hardware) 
required to compress and decompress data. The primary 
reason behind doing so is to reduce the storage space 
required to save the data, or the bandwidth required to 
transmit it. Although storage technology has developed 
significantly over the past decade, the same cannot be said 
for transmission capacity. Data compression schemes may 
broadly be classified into - 1. Lossless compression and 
2. Lossy compression. Lossless compression algorithms 
usually exploit statistical redundancy in such a way as to 
represent the sender's data more concisely without error. 
Lossless compression is possible because most real world 
data has statistical redundancy. Another kind of compression, 
called lossy data compression is possible if some loss of 
fidelity is acceptable. It is important to consider that in case 
of lossy compression, the original data cannot be 
reconstructed from the compressed data due to rounding off 
or removal of some parts of data as a result of redundancies. 
These types of compression are also widely used in Image 
compression [10,11,12]. The theoretical background of 
compression is provided by information theory and by rate 
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distortion theory [6,7,8,9]. This is a new data compression 
algorithm. It is not so much dependent about the characters 
repetition. Though its compression ratio is not so much but 
the differential technique makes the compression ratio to a 
constant value. The main advantage of this technique is that 
it can compress the output file which is produced after applying 
certain compression techniques on a file. So this algorithm 
gives good result when it is used for hybridization with other 
algorithm. 

II . Algorithm Strategy 

At first we need to break the file into some parts or some 
sub files. This differential condition is that the file will be 
broken when the no. of distinct characters will be n. Here n is 
an integer number which will be varied file to file. Now the 
total number of different characters is n then we consider the 
file as n base number system where each character is one of 
the elements of this n base number system. So, now the 
requirement is indexing of the different characters present in 
the sequential file from to n- 1 . Now we pick g numbers of 
characters from the original file at a time and then taking the 
index number of each character calculates its decimal value 
according to number theory system. Now represent this 
decimal value by s numbers of bits in binary format to get the 
compressed file. Here is a certain relation between g and s 
which is established by mathematical analysis in the later 
section. To get the best compression proper selection of this 
g and s parameter is very necessary. 

LIL CALCULATION 

Let take an n-base number system where n can be any 
integer. So the value of maximum element of this number 
system is n-1. Now we take g no. of such element of this 
number system. Then the maximum value of this g no. of 
elements would be, 

Max_val= (n-l)*n° +(n-l)*n' +(n-l)*n 2 + +(n-l)*n E ' 

=(n-l)[(n E -l)/(n-l)]=(ne-l). 

Then, similarly with s no. of bits the maximum binary value is 

(2 s -1). 

So, now if we transfer this n-base number system to binary 

number system then, 

Wecansay, (n E -l)= (2 S -1) ■* n E = 2 s ■* g*log(n) 

=s*log(2) -»s/g=log(n)/log(2) 

-»s/g = 3.322*log(n) 

Now, if we consider a file as a number system where each 
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character is an element of this system then actually g bytes 
can be replaced by s bits. So, we can say 8g bits will be 
replaced by s no. of bits. Now, we can say the 
%compression=(l-s/(8g))*100 % 

= (l-0.41525*log(n))*100% where n is the no. of different 
types of characters in the file. 

IV. Algorithm 

Steps for Compression: 

1. Begin 

2. Set store [0]=( space) 

3. Setcount=l 

4. Take an empty one order matrix with size n. Let the 
matrixisstore[]. 

5. DoUntil(endoffile) 

6. Read the file and pick the character from the file. 

7. If the character is present in the matrix store[ ] then go 
to step 8 else go to step 4 . 

8. Set store[count]=character 

9. count=count+ 1 

10. If count = n then go to step 1 1 else go to step 5 . 

1 1 . Break the file including the last character which is read. 

12. count= 1 

13. Go to step 5 
14 End of Loop. 

15. Apply the steps from step 15 to step 27 on each sub 
file for getting the compression. 

16. Find out the different types of characters in the file. 

17. Store the different characters in an array. 
Sequence [ ]= { sorted character set } . 

18. Count the no. of different characters. Now it will be n. 
So here we consider the file as a n-base number system. 

19. g=select an integer g. 

20. s=Ceiling value of (3.322*g*log(n)) 

21 . Do until (end of file) 

22. Pick up g no. of characters from the original file at a 
time. 

23. Find the position of the each picked up character within 
the Sequence[ ] and store the positional value into another 
arraypos[].Letpos[]={a ,a 1 , ,a g2 , a gl }. 

24 pos val= a * n° + a, * n'+ . . .+ a * n E ~ 2 + a , * n gl . 

r — 1 g-2 g-1 

25. Now represent the pos_val with s no. of bits in binary 
system. Put these bits into a file Result. 

26. Go to step 21. 

27. End of loop 

28. Result file is the compressed file. 

29. End 

Steps for Decompression: 

1. Begin 

2. Setj=0 

3. Sequence[ ] = { Character set of j th Sub file. } 

4. Do until end of compressed string or File. 

5. Take s no. of bits at a time and then calculate its decimal 
value. 



6. Let the decimal value is val. 

7. Divide val by n , until we store g no. of remainder in an 
arrayrem[ ]. Let rem[ ]={c , c 1? c 2 , ,c g2 , c g _, }. 

8. Seti=0andFlag=0. 

9. While(i<g and Flag=0) 

10. If rem[i]=Sequence[n-l] then Go to step 1 1 else Go to 
step 13. 

11. SetFlag=l. 

12. Go to step 9. 

13. i=i+l 

14. Go to step 9. 

15. End of While Loop. 

16. IfFlag=l then Goto 

17. Now map the each element of the array rem[ ] with the 
Sequence[ ] which is mentioned in Compression routine. 



Letrem[]={c , Cj, c 2 , 



,c , c t } .Then put the Sequence 



[c ], sequence[cj, sequence [c ] into a file extract. 

18. Go to step 4. 

1 9 . Now map the to i th element of the array rem [] with the 
Sequence[ ] which is mentioned in Compression routine. So, 

put the Sequence [c ], Sequence[cJ, 

Sequence[c ] in the file extract. 

20. Setj=j+1 

21. Sequence[ ] = {Character set of j th Sub file.} 

22. Go to Step 4. 

23. End of Loop. 

24. Extract will be the decompressed File. 

25. End 

V. Algorithm illustration 

For example consider a text like 'Bob is a boy' . Total no of 
characters are 12 but the character set is [B,o,b,i,s,a,y, ,]. They 
can be indexed as {B=0,o=l,b=2,i=3,s=4,a=5,y=6 and 7 for 
space}. Since number of characters in character set is 8, 
therefore n=8 and take g=4. So s=3.322*(log 8)*4. 
s=12. 00243. Taking ceiling value of s we get s=13. Now take 4 
characters from original string at a time and calculate the 
decimal value of the index. The first 4 characters {B,o,b, } 
and the index position are { 0, 1 ,2,7 } and calculate it in decimal 
value in suchaway, 0*8 A 0+1*8 A 1+2*8 A 2+7*8 A 3=3720. Again 
convert 3720 into binary to get a 1 3 bit value, put in the file in 
(01 1 1010001000). So by representing in this way the total size 
of compressed file will be 1 3*3= 39 bits, whereas the original 
size of the text was 8*12= 96bits. Now in the decryption 
process, convert the compressed bit string into the decimal 
value. So converting (0111010001000) 2 =(3720) 1() Again 
convert this 3720 with n=8 base number system then we get 
{ 7,2, 1 ,0 } and then reverse it { 0, 1 ,2,7 } . After mapping with 
index, we get back the original string { B,o,b } .In such a way 
decryption process can be done. 
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VI. Algorithm Discussion 
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Fig. 1 The graph for No. Of distinct characters in a File Vs 
Compression gain without differential technique. 
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Fig. 2 The graph for No. Of distinct characters in a File Vs 

Compression gain applying differential technique. (Differential 

condition is, n=20) . 

Table I 



FILE 


ORIGINS 
SIZE 


HUFFMaNM 


LZW 


PROPOSED 
.ALGORITHM 

[a S3 


Sample] .txt 


2.46KB 


ii.ii% 


43.9^*/. 


44.'l J -i 


Samplei! txt 


3.5BKB 


3$.i5% 


44 


18=4 


44.71K 


Sample .rtt 


2.6'KB 


40.B2H 


43 


l'i% 


44.56 J 4 


Sample Java 


4.12KB 


34. 5 Hi 


44 


S«4 


44.6-?-d 


Sample jg 


4.43KB 


33.B6H 


43 


8J«4 


UM% 


Sample&e 


5.'gKB 


35.53 J *i 


45 


12% 


44JS« 


Sample? .cpp 


5.*SKB 


S125S 


43 


MM 


44.3B J -D 


Saraplefll cpp 


3.1BKB 


59 _53* 


45 


M!4 


4436W 


Samp 1*9 .cpp 


5.3BKB 


36.35 J -i 


43 


,'l*-i 


44.4B j -d 


SamplslDjtt 


B.MKB 


3" fflt 


44 


JJ*A 


44.76K 


Sampla-Il.m 


5.24KB 


35.12?t 


44 


S'«i 


44.5^4 


Sanipl3-l2.tKt 


'.25KB 


34.13?', 


45 


n% 


44.1 i« 



Fig. 3 Compression Gain of proposed algorithm compared with 
existing algorithms 
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A Characters Set Size and Compression gain 

The above mentioned graph(Fig.l) just show how the 
compression gain varies with size of characters set when the 
differential technique is not applied i.e. no breaking of files. It 
gives better compression when the size of characters set is 
lesser. Here we can find some similarity with other well known 
statistical data compression techniques like Huffman 
algorithm 

B. Differential Technique and Compression gain 

The above mentioned table and graph(Fig.2) shows when 
we apply differential technique then we get constant 
compression. The selection of no. characters(variable 'n' in 
the compression routine) on which the breaking of the file is 
dependent is very important. This number should be the trade 
of between the compression gain and the no. of file breakings. 
If the number is very small then the no. of file breakings will 
be increase and it increase the complexity so much. Though 
it improves the compression gain also. Similarly if the selected 
number is so high then the no. of file breakings will be 
decreased which decrease the complexity also. But on the 
other hand compression gain will be deteriorated also. So the 
selected number should be a certain value by which the 
algorithm performs with a moderate compression gain and 
complexity. 

VII. Conclusions 

In this paper a new data compression algorithm is 
introduced. The unique feature of this algorithm is its 
simplicity. An entirely different technique is employed to 
reduce the size of text files. The technique of 'saving bits' is 
employed in this algorithm. Since every character is taken 
care of, so the output codes do not depend upon the 
repetition, like most of the other compression algorithms. 
Different combination of characters can be represented by 
fewer numbers of bits. After the code formation, ASCII 
replaces the binary numbers, which finally reduces the file 
size. The compression algorithm takes O(n) time, where n is 
the total number of characters in the file. Since the differential 
breaking follows Divide and Conquer policy, it takes O(nlogn) 
time. So, the total computation time required for this algorithm 
is proportional to O(nlogn). Quite a lot of research and 
findings led to the conclusion that there are no such 
algorithms in data compression that lay emphasis on 
differential compression based on number theory and bit 
reduction. 
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